1 Introduction

This document can be used to replicate the results from the manuscript on confounders. First, we provide a small example that serves as a simple case to build intuition. Second, we provide a larger example where we conduct a sensitivity analysis of unobserved/unknown confounders.

2 A first example to build intuition

2.1 The effect of a confounder

In our work we follow the notation of Pearl and claim that confounding is a purely causal concept (Pearl 2009). A confounder (\(C\)) is a variable that causally affects both the treatment (\(X\)) and the outcome (\(Y\)), and when a confounder exists we get a spurious (fake) association.

We need to distinguish between two types of studies common in software engineering: experiments (randomized controlled trial, RCT) and observational studies. One reason for why experiments are considered gold standard when conducting studies is that confounding is supposed to be, theoretically, a non-issue due to randomization (i.e., the effect of any confounders will be balanced out). So, very simply put, if we randomly allocate the treatment (\(X\)), then the causal effect of a confounder (\(C\)) on \(X\) and \(Y\) should not matter. With observational studies, on the other hand, the thought was for long that since the researcher do not control the environment a causal approach was impossible. We now know this is not true, and the Nobel Prizes of 2019 and 2021 are further indications of this.

If we look at it as software engineers, one possible way to visualize this would be to think about the impact of programming languages (e.g., Pythong and Java) on code quality. Some people might assume that there’s a strong causal effect of programming language on quality. However, it’s not unreasonable to assume that a software engineer’s skills could have a direct causal effect on which programming language they use (e.g., skilled people use R) and on the final quality of the artefact produced (e.g., skilled people perform better, no matter programming language).

A graphical summary (directed acyclic graph), of what we said above, can be visualized like this:

my_dag <- dagify(
  Lang ~ Skills,
  Qual ~ Skills,
  Qual ~ Lang,
  exposure = "Lang", # treatment
  outcome = "Qual",
  coords = list(
    x = c(Lang = 1, Skills = 2, Qual = 3),
    y = c(Lang = 1, Skills = 2, Qual = 1)
  )
)

ggdag(my_dag) + theme_dag()
A simple yet realistic DAG.

Figure 2.1: A simple yet realistic DAG.

In the plot above we see that the confounder, Skills, has a causal effect on the treatment, Lang, and on the outcome, Qual. Also, Lang has a direct causal effect on Qual, but that is what we’re generally speaking interested in estimating (i.e., treatment’s effect on the outcome).

Confounders are scary, but no need to panic. If we condition on a confounder (i.e., include it as a predictor in our model) we close the path \(\underrightarrow{\mathrm{Lang} \leftarrow \mathrm{Skills} \rightarrow \mathrm{Qual}}\) and, hence, an unbiased estimate of the direct effect of Lang on Qual can be estimated. However, this means that we must have measured Skills, so we can condition on that variable. As we will see later, omitted variable bias (i.e., we do not have access to some variables) is something we can reason about and provide convincing arguments that it likely does not affect the results of a study. But how dangerous is omitted variable bias?

Let’s generate some fake data first. That way we know the truth. For simplicity we’ll assume that Lang is not categorical but rather continuous (which in this case will not have any effect on the end result).

n <- 1e5
skills <- rnorm(n) # exogenous
lang <- 0.5 * skills + rnorm(n) # endogenous
# set language's effect on qual to 0.4
qual <- 0.3 * skills + 0.4 * lang + rnorm(n) # endogenous

d <- data.frame(
  skills = skills,
  lang = lang,
  qual = qual
)

# run two models, one where we don't condition on Skills, and one where we do
without_skills <- brm(
  qual ~ lang,
  data = d,
  refresh = 0
)

with_skills <- brm(
  qual ~ lang + skills,
  data = d,
  refresh = 0
)

Disregarding the estimate for our confounder, Skills (we’re seldom interested in the causal effect of a confounder, but there are exceptions), the estimates we have of Lang, i.e., \(\widehat{\mathrm{Lang}}\), is what interests us.

fixef(with_skills)[2, 1]
## [1] 0.3951058
fixef(without_skills)[2, 1]
## [1] 0.5154114

Conditioning on Skills allows us to get the correct estimate for Lang (\(\widehat{\mathrm{Lang}}_s \approx\) 0.4) since we close the backdoor. If we don’t condition on Skills we get a biased (in this case positive) estimate, i.e., \(\widehat{\mathrm{Lang}}_{\neg s} \approx\) 0.52, because of Skills’s effect on both Lang and Qual.

There are a few lessons to learn from the above simple example. First, confounders can have an effect on the estimate. Second, the bias is positive in this case, but can also be negative (change Skills’s sign on Lang and you’ll see). Third, if we want to have an unbiased estimate we need to condition on the confounder (Skills in this case), to close the backdoor that goes from \(\underrightarrow{\mathrm{Lang} \leftarrow \mathrm{Skills} \rightarrow \mathrm{Qual}}\).

If we conduct an obervational study, and if we don’t have access to Skills, i.e., it’s unmeasured for some reason, or perhaps unknown, then we are now facing omitted variable bias. However, all is not lost. We can still argue if it’s likely that Skills would nullify the causal effect of Lang on Qual.

2.2 Accounting for the unknown1

If we look at the DAG we introduced previously,

we see that there are two arrows going from Skills. First, we have \(\mathrm{Skills} \rightarrow \mathrm{Lang}\) and then we have \(\mathrm{Skills} \rightarrow \mathrm{Qual}\). These two edges are dependent on each other. If we want to assess the effect of Skills on Lang, one often talks about the Scaled Mean Difference (SMD).

The SMD is the confounder’s effect on the different groups in Lang (e.g., Java and Python). In math, it’s something like this:

\[\mathrm{SMD} = \frac{\overline{\mathrm{Skills}}_{\mathrm{Java}} - \overline{\mathrm{Skills}}_{\mathrm{Python}}}{\sigma_{\mathrm{Lang}}}\]

The difference in mean between Java and Python can then, by using SMD, be represented as a difference in \(\sigma\). For example, a \(3\sigma\), \(1\sigma\), or \(0.1\sigma\) difference between two categories (since we divide it by \(\sigma_{\mathrm{Lang}}\)).

If we next look at the other arrow (\(\mathrm{Skills} \rightarrow \mathrm{Qual}\)), we don’t need to think about the SMD since Skills is affecting the outcome Qual and not the treatment Lang. Here we can simply estimate the effect as a regular \(\beta\) estimate that we’re used to, i.e., in our case the \(\beta_{\mathrm{Skills}}\) estimate of our confounder Skills. (One could for example say that \(\beta_{\mathrm{Skills}} = 3\) meaning that a 1-unit change in Skills would imply a 3-unit change in Qual.)

With these two concepts, the SMD and the \(\beta_{\mathrm{Skills}}\) estimate, we can now add assumptions to our analysis and argue how likely it is that they would hold. According to Lin, Psaty, and Kronmal (1998) we now have three options to choose between:

  • Make assumptions concerning \(\beta_{\mathrm{Skills}}\) and estimate how large SMD needs to be to cancel out the direct effect of Lang on Qual.
  • Make assumptions concerning SMD and estimate how large \(\beta_{\mathrm{Skills}}\) needs to be to cancel out the direct effect of Lang on Qual.
  • Finally, by specifying both SMD and \(\beta_{\mathrm{Skills}}\) we can investigate the number and size of confounders, \(n_c\), needed to cancel out the direct effect of Lang on Qual.

We will focus on the first two analyses, since they are more common.

2.2.1 Assumptions concerning \(\beta_{\mathrm{Skills}}\)

If we continue with the example, we can collect the true effect’s lower confidence interval (\(\approx 0.4\)) and assume \(\beta_{\mathrm{Skills}} = 1.5\), i.e., a 1-unit change in Skills would imply a 1.5-unit change in Qual:

tidy(with_skills, conf.int = TRUE) |>
  filter(term == "lang") |>
  pull(conf.low) |>
  tip(confounder_outcome_effect = 1.5)
## The observed effect (0.39) WOULD be tipped by 1 unmeasured confounder
## with the following specifications:
##   * estimated difference in scaled means between the unmeasured confounder
##     in the exposed population and unexposed population: -2.33
##   * estimated relationship between the unmeasured confounder and the outcome: 1.5
## # A tibble: 1 × 5
##   effect_adjusted effect_observed exposure_confounder_e…¹ confounder_outcome_e…²
## *           <dbl>           <dbl>                   <dbl>                  <dbl>
## 1               1           0.389                   -2.33                    1.5
## # ℹ abbreviated names: ¹​exposure_confounder_effect, ²​confounder_outcome_effect
## # ℹ 1 more variable: n_unmeasured_confounders <dbl>

then a hypothetical unmeasured continuous confounder with \(\beta_{\mathrm{Skills}} = 1.5\) and \(\mathrm{SMD} \approx -2.3\), would cancel out the effect (in our case \(0.4\)) of Lang on Qual. Hence, given our assumptions, a 1-unit change in an unmeasured confounder (Skills) leading to a 1.5-unit change in Qual and, et voilà, you would not have a Nature paper. Visualizing what this means might help.

Assume we have som quality measure we’re interested in distributed as \(\mathrm{Normal}(45,1.5)\) (this could for example be the average number of bugs per \(x\) lines of code introduced into a system by any language). Our scaled difference that we received above (i.e., \(-2.3\)) can be transformed to a quality score by multiplying with \(\sigma = 1.5\) (our assumed dispersion), i.e., \(1.5 \cdot -2.3 \approx -3.45\). This implies that between two categories (Java and Python) we would need to see a difference of approximately \(3.45\) bugs in a quality score when implementing a feature. Is this unlikely? Perhaps; context matters.

However, as an empirical researcher you might not want to bet your life on it if you look at the plot below; this is how large a difference there would be between the two languages. More importantly, you should present these results to your readers and argue for why it is either a) unlikely that a counfounder exists, or b) go measure the latent variable Skills carefully.

Figure 2.2: The difference between language on the outcome scale.

2.2.2 Assumptions concerning SMD

As previously explained, there are two arrows going from Skills (see below). In the previous section we focused on the arrow \(\mathrm{Skills} \rightarrow \mathrm{Qual}\), i.e., setting the \(\beta_{\mathrm{Skills}}\) estimate to analyze the size of SMD needed to cancel out the direct causal effect of Lang on Qual. Here we will now do the opposite, i.e., given an SMD, how large a \(\beta_{\mathrm{Skills}}\) is needed to cancel out the causal effect of \(\mathrm{Lang} \rightarrow \mathrm{Qual}\).

For the sake of completness we’ll assume \(\mathrm{SMD} = -2.3\) to validate the result in the previous section, i.e., we should end up with \(\beta_{\mathrm{Skills}} \approx 1.5\).

tidy(with_skills, conf.int = TRUE) |>
  filter(term == "lang") |>
  pull(conf.low) |>
  tip(exposure_confounder_effect = -2.3)
## The observed effect (0.39) WOULD be tipped by 1 unmeasured confounder
## with the following specifications:
##   * estimated difference in scaled means between the unmeasured confounder
##     in the exposed population and unexposed population: -2.3
##   * estimated relationship between the unmeasured confounder and the outcome: 1.51
## # A tibble: 1 × 5
##   effect_adjusted effect_observed exposure_confounder_e…¹ confounder_outcome_e…²
## *           <dbl>           <dbl>                   <dbl>                  <dbl>
## 1               1           0.389                    -2.3                   1.51
## # ℹ abbreviated names: ¹​exposure_confounder_effect, ²​confounder_outcome_effect
## # ℹ 1 more variable: n_unmeasured_confounders <dbl>

Above we see that \(\beta_{\mathrm{Skills}} \approx 1.50\) (i.e., the confounder_outome_effect), thus we have an indication that this works both ways. Given that we have introduced two distinct concepts used to estimate unobserved/unknown confounders, we can now turn our attention to a more involved example.

3 A second, more involved, example

A larger more realistic DAG taken from Feldt et al. (2023).

Figure 3.1: A larger more realistic DAG taken from Feldt et al. (2023).

Feldt et al. (2023) extracted causal links from several primary studies and attempted to assemble disparate evidence into a Unified Directed Acyclic Graph (UDAG) as above. In the UDAG above the letters are, from left to right and from bottom to top, short for: \(BA\) Business Area, \(PL\) Programming Language, \(H\) Hardware, \(ST\) Software Type, \(PT\) Project Type, \(S\) Size of software, \(L\) Location, \(E\) Effort, \(O\) Organization type, and \(TS\) Team Size.

We’d like to know the direct effect of \(TS\) on \(E\) (both in black), i.e., Team Size’s direct effect on Effort. One would think it’s a straightforward thing to measure, but not taking into account confounders would lead to biased estimates, i.e., due to spurious associations.

The UDAG, as we will see, will also help us understand if a sensitivity analysis of unknown confounders is needed. In short, it would allow a future study to either be more sure of not facing omitted variable bias from start, i.e., the UDAG is likely correct, or force us to collect variables that are currently unmeasured, i.e., the UDAG needs to be complemented with new variables.

The above UDAG is clearly a more involved example and contains all elemental constructs one can find in a DAG, e.g., Organization (\(O\)) is a confounder, Hardware (\(H\)) is part of a pipe since \(BA \rightarrow H \rightarrow E\), and the lower part also contains an ancestors of the outcome, e.g., \(L\). Additionally, we also see that there are colliders (e.g., \(PL\)) in the UDAG. Let’s see what we should condition on by calculating the adjustment set for the UDAG:

adjustmentSets(udag, effect = "direct")
## { O, S }

The adjustment set is \(\mathcal{A} = \{ O, S \}\). By conditioning on \(O\) and \(S\) (i.e., adjusting for \(O\) and \(S\)) we will be able to estimate the direct effect of \(TS\) on \(E\). But what if there are unknown confounders? It’s not unlikely given that the studies that contributed to the UDAG are all observational.

Let’s do this systematically and overlay numbers on the plot, from bottom to top, to make sure that we don’t miss anything.

The DAG where we have also marked in <span style='background-color: #D55E00;color: white;'>orange</span> the adjustment set, and numbered all edges.

Figure 3.2: The DAG where we have also marked in orange the adjustment set, and numbered all edges.

In the plot above we have the treatment and outcome in black, as before, but we have now also added the color orange to signify that these are variables we condition on.

If we look at the plot above, and take Edge #1 as an example, the question we’re asking is:

If there would be a confounder \(C\) between \(PL\) and \(O\), i.e. \(PL \leftarrow C \rightarrow O\), (in addition to the existing edge \(PL \leftarrow O\)), how would that affect our possibility to estimate the direct effect of \(TS\) on \(E\), given our adjustment set \(\mathcal{A}\)?

The potential answer to the above question would be one of three cases:

  1. It would not affect the estimate (n/a). This may happen because the chosen adjustment set already blocks the confounding effect of a confounder, \(C\).
  2. By conditioning on other variable(s) we close the path and then the confounder will not affect the estimate (co).
  3. We truly have a case of omitted variable bias, which means that we need to conduct a sensitivity analysis (ovb).

In Case 1 (n/a), a confounder would not affect the possibility for us to receive an unbiased causal estimate of the direct effect. In Case 2 (co), the confounder would affect our possibility to get an unbiased estimate; however, if we condition on \(1,\ldots,n\) other variables, the confounder would no longer affect the results and, thus, fall under the n/a category. Finally, Case 3 (ovb), is the one we are terrified about. Here a potential confounder would affect our possibility to get an unbiased estimate. This means that we either need to add additional assumptions and argue that this confounder is unlikely to exist, or think harder, identify such a confounder, and then measure it carefully.

If we next analyze what it would imply if an unknown confounder truly existed at each of the above edges, while we condition on \(S\) and \(O\), we quickly see that there are only two edges that we need to be careful about: Edges #17 and #18.

Edge #17 is a bit tricky. If there would be a confounder between \(S\) and \(TS\), i.e., \(S \leftarrow C \rightarrow TS\) (in addition to Edge #17), it would bias the estimate. However, in this particular case we don’t have to see it as a case of omitted variable bias (Case #3 above), since if we condition on \(BA\) and \(ST\), we would close those paths and still be able to get an unbiased estimate of the direct effect of \(TS\) on \(E\). In short, this is a case of co as explained above, which we reach by iteratively conduct another adjustment set analysis.

Edge #18, however, is more serious. A confounder between \(TS\) and \(E\), i.e., \(TS \leftarrow C \rightarrow E\), would bias our estimate. This is a clear case of omitted variable bias (i.e., ovb), since we can’t solve it by conditioning on any other variables. In short, if such an unknown/unmeasured confounder exists, we’d be toast.

Below is an updated UDAG where we mark all variables that we condition on in orange, the confounder is in yellow, while the outcome and treatment are in black. Other variables that we have no use for are now removed from the plot and further analysis. Already this step has clarified that there are a number of variables that we don’t have to collect.

The DAG after closing backdoor paths and adjusting for possible confounders. Note that an umeasured confounder might still exist (in <span style='background-color: #F0E442;color: white;'>yellow</span>)

Figure 3.3: The DAG after closing backdoor paths and adjusting for possible confounders. Note that an umeasured confounder might still exist (in yellow)

Evidently, we see that we condition on more variables than \(S\) and \(O\) (which was declared as the minimal adjustment set). We also included \(BA\) and \(ST\) because we wanted to close backdoor paths to avoid any unmeasured or unknown confounders.

How did we come to the results above? Well, analyzing the UDAG according to causal calculus (Pearl 2009) leads to the above results and with experience one receives intuition when analyzing DAGs. However, one can (should) also use software to get some assistance, e.g., the dagitty package in R. If you are so inclined you can verify online that our analysis above is correct.

To summarize, we have a challenge dealing with a confounder \(C\), which affects both our treatment \(TS\) and the outcome \(E\), i.e., \(TS \leftarrow C \rightarrow E\). Perhaps a researcher could swear that such a confounder cannot physically exist for the studied phenomena, but if one is not 100% sure, then we should do better and fall back on sensitivity analysis of unknown confounders. Before we do that, we need to set som assumptions about the effect size we are expecting. How large a direct effect could we anticipate, and is it likely that we can estimate that effect given a certain sample size?

3.1 Establishing relevant effect sizes

In addition to the outcome \(E\) and the treatment \(TS\), there are several variables we’ve decided to control (adjust) for, i.e., \(\mathcal{A} = \{ S, O, ST, BA \}\) (the latter two so that any confounder between \(S\) and \(TS\) is cancelled out).

Ultimately, the case we have in front of us is the classical case of a confounder affecting both the treatment and the outcome.

The elementary DAG needing further analysis.

Figure 3.4: The elementary DAG needing further analysis.

First priority is to decide on any assumptions concerning the direct effect of \(TS\) on \(E\), e.g., what sample sizes do we need, in order to see ‘signficant effects.’ The second priority is that, once we believe that an effect can be found (given our sample size), we need to ascertain the effect a confounder must have on \(TS\) and \(E\) to cancel out that effect (i.e., to make it cross zero according to some credible interval).

Let’s start with the former priority. If we discover that we can’t find a relevant effect because won’t be able to collect a large enough number of data points, then it might not be worthwhile to even design a study in the first place, i.e., a statistical power analysis in the traditional sense.

Let’s start this by summarizing the characteristics of the three variables of interest are:

  • Team Size (\(TS\)), treatment: \(\mathbb{N}^+\), a natural positive number indicating number of members in a team.
  • Effort (\(E\)), outcome: \(\mathbb{N}^+\), a natural positive number indicating hours spent in a project.
  • Confounder (\(C\)): Assume \(\mathbb{R}\), i.e., a \(\mathsf{Normal}\) distribution.

and then design a generative model according to the above characteristics and the DAG.

# This part only sets up our stat model so we can execute it in the sim()

# Just set some initial effects; we'll vary this in sim() anyway
b_cts <- 1.0
b_c <- 1.3
b_ts <- 0.1

# Our generative model
c <- rnorm(100)
ts <- rpois(100, 5 + b_cts * c) # Confounder, c, affects TeamSize, ts
e <- rnorm(100, b_c * c + b_ts * ts) # Outcome E is affected by ts and c!
df <- data.frame(C = c, TS = ts, E = e)

# Set weakly informative priors
# N(0,2) prior on intercept; still allows for absurd, ⍺ = {-5,5}, effects
# N(0,1) prior on β est; still allows for absurd, β = {-3,3}, effects
# Exponential(1) on σ; still allows for absurd, σ = 5, effects
# Run once so we can reuse model in sim()
m <- brm(E ~ C + TS, data = df, refresh = 0,
  prior = c(prior(normal(0, 1), class = b),
            prior(normal(0, 2), class = Intercept),
            prior(exponential(1), class = sigma))
)
# Simulate varying sample size `n` and `n_sim` with param combinations `g`
# n A list of integers. Randomly generate `n` number of values.
# n_sim An integer. Run `n_sim` times to capture stochasticity.
# g A list from `expand.grid()` with all combinations of β values.
# returns A numeric vector of size `g` combinations,  `n_sim` rows, and `n`
# columns
sim <- function(n = 125, n_sim = 200,
                g = expand.grid(c(-1, 1), c(-1, 1), c(-0.1, 0.1))) {

  l <- list()

  for (i in seq_len(nrow(g))) {
    b_cts <- g[i, 1]
    b_c <- g[i, 2]
    b_ts <- g[i, 3]

    l[[i]] <- foreach(j = n, .combine = cbind) %:% # nolint
      foreach(k = seq_len(n_sim), .combine = c) %dopar% { # nolint
      c <- rnorm(j) # confounder # nolint
      ts <- rpois(j, 5 + b_cts * c) # TS effect is 5 # nolint
      e <- rnorm(j, b_c * c + b_ts * ts) # E affected by b_TS and b_C # nolint
      d <- data.frame(C = c, TS = ts, E = e)
      m <- update(m, newdata = d, refresh = 0)
      fixef(m)[3] # nolint
    }
  }
  return(l)
}
# Set up and run simulation

# Sample sizes for the rand() functions
n <- seq(10, 100, 5)

# number of sim runs; thank you CLT...
n_sim <- 200

# For β_ts we need to consider an effect size. Generally, for a coefficient β,
# effect sizes between 0.10–0.29 are said to be small, effect sizes between
# 0.30–0.49 are medium, and effect sizes of 0.50 or greater are large. YMMV
#
# Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.;
# Erlbaum: Hillsdale, MI, USA, 1988; ISBN 0-8058-0283-5.
# Fey, C.F.; Hu, T.; Delios, A. The Measurement and Communication of Effect
# Sizes in Management Research. Manag. Organ. Rev. 2022, 1–22.
g <- expand.grid(c(-1.0, -0.5, 0.5, 1.0), # β_cts
                 c(-1.0, -0.5, 0.5, 1.0), # β_c
                 c(0.1, 0.3, 0.5)) # β_ts! Assume positive effect

# 48 effect combinations x 19 diff sizes of samples x 200 sims = 182,400
registerDoParallel(detectCores() - 1)

res <- sim(n, n_sim, g)

stopImplicitCluster()
# bail if no data exists
stopifnot(res[[1]] != TRUE)

# create df instead of list
res_d <- data.frame(res)

# remove column names
names(res_d) <- NULL

# create a data frame d[912 x 7]
# n_id = the 19 sample sizes repeated {10,15,...,100} x 48
# s_id = each param setting (48) was repeated with 19 diff sample sizes
#   {1,1,...,1}, {2,2,...,2}, ..., {48,48,...,48}
# mu = mean of each param estimated by HMC
# low50 = 50% highest posterior density (low)
# high50 = 50% highest posterior density (high)
# low95 = 95% highest posterior density (low)
# high95 = 95% highest posterior density (high)
# b_cts = setting for β_cts parameter
# b_c = setting for β_c parameter (confounder)
# b_ts = setting for β_ts parameter (outcome)
d <- data.frame(
  n_id  = rep(n, length(res)),
  s_id  = rep(seq_along(res), each = ncol(res[[1]])),
  mu = apply(res_d, 2, mean),
  low50 = apply(res_d, 2, function(x) HPDI(x, prob = 0.5)[1]),
  high50 = apply(res_d, 2, function(x) HPDI(x, prob = 0.5)[2]),
  low95 = apply(res_d, 2, function(x) HPDI(x, prob = 0.95)[1]),
  high95 = apply(res_d, 2, function(x) HPDI(x, prob = 0.95)[2]),
  b_cts = rep(g[, 1], each = ncol(res[[1]])),
  b_c = rep(g[, 2], each = ncol(res[[1]])),
  b_ts = rep(g[, 3], each = ncol(res[[1]]))
)

s <- seq(1, 912, 19)

l <- list()

for (i in s){
  j <- eval(i) + 18

  l[[eval(i)]] <- ggplot(data = d[eval(i):j, ], aes(x = n_id, y = mu)) +
    geom_line() +
    geom_ribbon(aes(x = n_id, ymin = low50, ymax = high50), alpha = 0.15) +
    geom_ribbon(aes(x = n_id, ymin = low95, ymax = high95), alpha = 0.1) +
    geom_hline(yintercept = 0, linewidth = 0.3) +
    geom_hline(yintercept = d[eval(i), 10], linewidth = 0.25,
               color = "#D55E00") +
    ylab(expression(beta[ts])) +
    xlab(as.expression(bquote("{" * beta[ts] ~ "=" ~
                                .(d[eval(i), 10]) * ";" ~ beta[cts] ~ "=" ~
                                .(d[eval(i), 8]) * ";" ~ beta[c] ~ "=" ~
                                .(d[eval(i), 9]) * "}"))) +
    scale_x_continuous(breaks = seq(20, 100, 20)) +
    theme_tufte()
}
**Main plot: Small effect**

Figure 3.5: Main plot: Small effect

**Main plot: Medium effect**

Figure 3.6: Main plot: Medium effect

**Main plot: Large effect**

Figure 3.7: Main plot: Large effect

So what is the point with the above plots? As one can see (open the top image in its own window so you can alternate with the text), we have decided to fix the true effect to \(\beta_{ts} = 0.1\) for all individual subplots (in the other two main plots we set \(\beta_{ts} = \{0.3, 0.5\}\)). The assumption is that we are expecting a small, medium, or large direct causal effect of \(TS\) on \(E\). For each of the main plots we then also vary the other effects (i.e., \(\beta_{c}\) and \(\beta_{cts}\)):

expand.grid(b_c = c(-1, -0.5, 0.5, 1), b_cts = c(-1, -0.5, 0.5, 1))
##     b_c b_cts
## 1  -1.0  -1.0
## 2  -0.5  -1.0
## 3   0.5  -1.0
## 4   1.0  -1.0
## 5  -1.0  -0.5
## 6  -0.5  -0.5
## 7   0.5  -0.5
## 8   1.0  -0.5
## 9  -1.0   0.5
## 10 -0.5   0.5
## 11  0.5   0.5
## 12  1.0   0.5
## 13 -1.0   1.0
## 14 -0.5   1.0
## 15  0.5   1.0
## 16  1.0   1.0

The \(y\)-axis in each subplot then depicts the estimated \(\beta_{ts}\) effect, while the \(x\)-axis depicts the sample sizes, i.e., in the simulation we used \(N=\{10, 15, \ldots, 100\}\). Having \(100\) companies participating in a study is perhaps not realistic, and many would say that already having more than \(20\) participating would be above expectations\(\ldots\)

Still focusing on the main top plot, in each individual subplot we see the estimate of \(\beta_{ts}\) (after \(200\) simulations), which is the somewhat wiggly black line moving around the red line. The red line is the ground truth, i.e., \(\beta_{ts} = 0.1\) in this case. The shaded areas are then the uncertainty around the estimated \(\beta_{ts}\). The darker shade is the \(50\)% highest posterior density estimate (you can see it as a confidence interval, but in this case it actually works the way you think). The lighter shade is the commonly used \(95\)%-level we see being used to claim ‘significance.’

If we first take a helicopter view of all subplots we see a few things immediately.

  1. We can recover the ground truth. The wiggly black line converges around the red line quickly as we increase the sample size. That, in itself, is an indication that our model represents the underlying data generation process well (which is not strange in itself since the generative model and the inference model are similar).
  2. All subplots are very similar. This is the only reason for plotting all these plots, i.e., to visualize all the combinations and see the similarity. In short, we don’t have to worry about the other effects affecting the direct causal effect of interest, no matter the sign or magnitude of the other effects. It’s reassuring, but given causal calculus, it should not be surprising.
  3. All subplots show (wiggly black line) that the estimate is not close to the ground truth (red line) in small sample sizes. However, already when we get close to \(20\)\(30\) samples the estimate starts to stabilize around the ground truth. Thank you Central Limit Theorem.
  4. All subplots start with much uncertainty around the estimate due to small sample sizes. If we examine the top main plot we see that the shade (light or even the dark) crosses zero very much at the beginning, i.e., this is an effect we would have a hard time to uncover due to a small sample size. When the direct causal effect is small and we have small sample sizes, chances of us finding a significant result is slim to none. If we find an effect it will simply be pure luck.
  5. Even when we have large sample sizes it’s questionable if we would uncover a small effect, i.e., \(\beta_{ts} = 0.1\). This is indicated by the shaded area almost always crossing zero, even at large sample sizes.

Before looking at the middle main plot, let’s first look at the third (last) main plot, i.e., where \(\beta_{ts}=0.5\). It’s evident that since the effect is so large (i.e., \(0.5\) instead of \(0.1\)) we keep clear of \(0\) even at the smallest sample size, i.e., \(N=10\). However, what we should ask ourselves is: How likely is it that the phenomenon we are studying would show such a strong effect? Perhaps not very likely…

Let’s now look at the middle plot, i.e., \(\beta_{ts} = 0.3\). Here it’s evident that once we come up to \(N \approx 20\) we are in the clear. We will be able to uncover such an effect size, since the uncertainty (shaded area) does not cross zero.

All the above indicates strongly that if we expect a small effect, which is common in much science, then we might as well forget about designing a study and save us, and any participating companies, time and resources. However, if we expect a large effect, then we could get away with a very small sample size, i.e., perhaps even \(N=10\). But, is it likely that there exists such a strong effect? Perhaps not. Finally, a more realistic case would be a medium effect, if there is an effect at all. We can actually uncover such an effect if we move up to \(N \approx 20\).

Nonetheless, all is not joy. As researchers we work under certain assumptions. One such assumption in our case is that our DAG is a correct representation of reality. What if it’s not? We need to ascertain the effect a confounder must have on \(TS\) and \(E\) to cancel out that effect (i.e., to make it cross zero according to some credible interval):

  • Make assumptions concerning the confounder (which is unknown or unmeasured) and estimate how large SMD needs to be to cancel out the direct effect (0.3) of \(TS\) on \(E\).
  • Make assumptions concerning SMD and estimate how large the confounder (which is unknown or unmeasured) needs to be to cancel out the direct effect (0.3) of \(TS\) on \(E\).

Let’s do that.

3.2 Sensitivity analysis of unknown confounders

We have come to the conclusion that a medium-sized effect (\(\beta_{ts} = 0.3\)) could possibly be discovered given the assumptions we made (assuming, of course, that we can collect the sample size needed). But, recall, an unmeasured confounder \(C\) could exist between the treatment \(TS\) and the outcome \(E\).

Elementary DAG with assumptions.

Figure 3.8: Elementary DAG with assumptions.

To move forward in this matter, we need to set assumptions concerning the effect an unknown confounder, \(C\), has on the outcome \(E\) (i.e., \(\beta_{ce}\)) or we need to set assumptions concerning how the confounder, \(C\), affects the treatment \(TS\) through the Scaled Mean Difference (i.e., \(\mathrm{SMD}\)).

A first step could be to step through \(\beta_{ce}\) values, e.g., given \(\beta_{ts} = 0.1\), and see what this implies for \(\mathrm{SMD}\).

The interplay between the effect of a confounder on the outcome ($\beta_{ce}$) and the scaled mean difference (SMD) in the confounders effect on the treatment (SMD).

Figure 3.9: The interplay between the effect of a confounder on the outcome (\(\beta_{ce}\)) and the scaled mean difference (SMD) in the confounders effect on the treatment (SMD).

On the vertical axis \(\mathrm{SMD}\) is plotted, while on the horizontal axis, the effect the confounder has on the outcome (\(\beta_{ce}\)), is plotted. What we see is that the closer to zero \(\beta_{ce}\) comes, the higher (or lower) \(\mathrm{SMD}\) needs to be to cancel out (or tip) the estimated effect. This is logical, since if \(\beta_{ce}\) is very small, \(\mathrm{SMD}\) needs to compensate to cancel out the estimated effect (with a large, positive or negative, scaled mean difference).

Let’s simply step through some effects that might be relevant for us, i.e., \(\beta_{ce}\) effects between \(0.1\) to \(0.5\), and plot them to look at what this implies concerning \(\mathrm{SMD}\). We will also plot several hypothetical \(\beta_{ts}\) effects, not only the assumed \(\beta_{ts} = 0.3\).

The relationship between $\beta_{ce}$ and $\mathrm{SMD}$.

Figure 3.10: The relationship between \(\beta_{ce}\) and \(\mathrm{SMD}\).

Focus first on the main plot and not the inset. What the plot is trying to convey is that if, e.g., \(\beta_{ce} = 0.2\), that is, the confounder’s effect on the outcome, then the effect the treatment has on the outcome, i.e., the blue line representing \(\beta_{ts} = 0.3\), would be cancelled out by \(\mathrm{SMD} = 1.5\) (indicated by the purple circle). (The inset then shows the inverse, i.e., assuming that \(\beta_{ce}\) is negative.)

First, Morasca and Russo (2001) presented some estimates of effort where \(\sigma = [0.35, 7.2]\); a very large dispersion, indicating much uncertainty. If we’d assume \(\sigma = 0.35\), it would imply that a change in the outcome would be \(\mathrm{SMD} \cdot \sigma = 1.5 \cdot 0.35 \approx 0.5\) units, while \(\sigma=7.2\) would indicate roughly \(11\) units. That is an order of magnitude in difference. However, these values are values for actual estimates of effort, i.e., the outcome, and not neccessarily representative of the effect a confounder would have on the treatment (that then affects the outcome). But we would assume the effect to be smaller. If not, we’ve missed to account for a confounder that has a stronger effect on the treatment, than the effect the treatment has on the outcome. That is not likely to happen.

Second, we are fairly certain that \(\beta_{ts}\) will not be of a large effect size. Earlier, we talked about \(0.3\) as a realistic value, and that we would not expect a larger effect size than that. If we look at the plot above, perhaps an even more realistic [conservative] view to have is to assume \(\beta_{ts} = 0.2\) (the black line). If that would be the case, one can see that we’d expect an even smaller \(\mathrm{SMD}\) to tip the results, i.e., there’s an increased risk that we miss a confounder with such a small \(\mathrm{SMD}\). One could summarize this by comparing with physics, when effects are small we reach a singularity and everything breaks down due to the uncertainty.

Third, the opposite also holds, i.e., if we move to the right on the horizontal axis, it is evident that all effect sizes for \(\beta_{ts}\) are squished together the further to the right we move (i.e., the larger the \(\beta_{ce}\) estimates are). The lesson here is that if we expect larger effect sizes, \(>0.5\), for the confounder’s effect on the outcome, one could make a convincing argument that the risk of such an effect being cancelled out would be almost negligible. In short, we would be aware of such a confounder, if we’d done our homework.

4 Concluding remarks

So what has this analysis given us? Hopefully a deeper understanding that one needs to do much more upfront analysis before designing and executing a study. In short, we can argue, from a probabilistic view point, if it’s worth to do a study or not.

Initially we presented the concept of confounders. Our hope is that researchers start thinking about confounders more often and to take these constructs seriously. Much software engineering research today, we suspect, does not do that.

After, initially, having introduced the concept of confounders we then provided a larger, more realistic, example. The example is built around a causal systematic review. One of the aims of the causal systematic review was to indicate where areas of further research should be done, but also to synthesize the current knowledge into a directed acyclic graph (DAG). The DAG was used as input in our analysis, and as an example we assumed that there is an outcome (\(E\)) and a treatment (\(TS\)) we were interested in investigating further. However, a more complex DAG such as the one in Fig. 3.1, required us to be careful and, first, decide on a minimal adjustment set to receive an unbiased estimate of the direct causal effect of \(TS\) on \(E\), i.e., \(\mathcal{A} = \{O, S\}\).

Next, we analyzed the DAG to see which other backdoor paths we could close, and which nodes we could condition on to minimize the risk of having an unknown/umeasured confounder affecting our analysis, further growing our adjustment set to \(\mathcal{A} = \{O,S,ST,BA\}\).

Finally, we faced the conundrum where a potential confounder \(C\) could still affect the estimate, since it potentially could affect the treatment and the outcome. To deal with this a number of assumptions were needed to be established, to move forward in this matter. Assumptions were made concerning:

  • the type of variables we were dealing with (e.g., \(\mathbb{N}^+\) and \(\mathbb{R}\)).
  • effect sizes (where we exhaustively tried all variations for realistic effect sizes).
  • sample sizes \(N\) (where we tried several levels, i.e., \(10, 15, \ldots, 100\)).

The analysis showed that we can,

  • recover the ground truth, and
  • trust the estimates when \(N \approx 20\).

The last part of the analysis examined the trade-off between \(\mathrm{SMD}\) (the effect a confounder has on the treatment) and \(\beta_{ce}\) (the effect a confounder has on the outcome), given the treatment’s effect size on the outcome.

To conclude, if one can collect data from \(20+\) companies, and if we assume that a confounder’s effect on the outcome is \(\beta_{cs} \leq 0.3\) or the scaled mean difference is \(SMD \leq 1.5\), we are safe. We argue that it’s fairly safe to assume that we would know of such a confounder.

References

Feldt, R., M. Shepperd, C. A. Furia, and R. Torkar. 2023. “Causal Systematic Reviews—Synthesizing Scientific Knowledge Through Causal Links.” arXiv.
Lin, D. Y., B. M. Psaty, and R. A. Kronmal. 1998. “Assessing the Sensitivity of Regression Results to Unmeasured Confounders in Observational Studies.” Biometrics 54 (3): 948–63.
Morasca, S., and G. Russo. 2001. “An Empirical Study of Software Productivity.” In 25th Annual International Computer Software and Applications Conference. COMPSAC 2001, 317–22. https://doi.org/10.1109/CMPSAC.2001.960633.
Pearl, J. 2009. Causality: Models, Reasoning and Inference. 2nd ed. Cambridge University Press.